For financial organizations, creating precise analytical credit scoring models has become a top priority. Big data technology has invaded the financial sector thanks to advances in science and technology, ushering in a new era for personal credit inquiry. In this project, I'll predict customer default using a machine learning technique.

Table of Contents:¶

  1. Basic descriptive statistic
  2. EDA - Exploratory Data Analysis
    1. Univariable analysis
    2. Multivariable analysis
  3. Feature Transformation and Train/Test Split
  4. Train and Validate The Model
  5. Model explanation, performance and strategy
  6. Conclusion

About the dataset¶

The target is the variable default.

The data has the following structure:

  • Observation_id: unique observation id.
  • Checking_balance: Status of the existing checking account (German currency).
  • Savings_balance: Savings account/bonds (German currency).
  • Installment_rate: Installment rate in percentage of disposable income.
  • Personal_status: Personal status and sex.
  • Residence_history: Present residence since.
  • Installment_plan: Other instalment plans.
  • Existing_credits: Number of existing credits at this bank.
  • Dependents: Number of people being liable to provide maintenance for.
  • Default: 0 is a good loan, 1 is a defaulting one.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import math
import seaborn as sns
import plotly.express as px
from functools import partial
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import metrics
import shap
import kds

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

1. Basic descriptive statistic¶

This part will examine the fundamental descriptive statistics to comprehend the shape, anomaly values, and columns that require reformatting or transformation.

In [2]:
default_missing = pd._libs.parsers.STR_NA_VALUES
default_missing.add('none')
data = pd.read_csv('credit.csv', index_col=0, na_values=default_missing)
In [3]:
data.head()
Out[3]:
checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors residence_history property age installment_plan housing existing_credits default dependents telephone foreign_worker job gender
0 -43.0 6 critical radio/tv 1169 NaN 13 years 4 single NaN 6 years real estate 67 NaN own 2 0 1 2.349340e+09 yes skilled employee male
1 75.0 48 repaid radio/tv 5951 89.0 2 years 2 NaN NaN 5 months real estate 22 NaN own 1 1 1 NaN yes skilled employee female
2 NaN 12 critical education 2096 24.0 5 years 2 single NaN 4 years real estate 49 NaN own 1 0 2 NaN yes unskilled resident male
3 -32.0 42 repaid furniture 7882 9.0 5 years 2 single guarantor 13 years building society savings 45 NaN for free 1 0 2 NaN yes skilled employee male
4 -23.0 24 delayed car (new) 4870 43.0 3 years 3 single NaN 13 years unknown/none 53 NaN for free 2 1 2 NaN yes skilled employee male
In [4]:
data.shape
Out[4]:
(1000, 22)
  • At the first look, the data contains only 1000 observations with 22 columns
  • NA values are presented in the dataset as well
  • employment_length & residence_history are not in an unified format --> need to convert to an unified format, either months or years
  • We can create the derive feature from telephone column and removed the telephone column as it can't be used as a feature
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   checking_balance      606 non-null    float64
 1   months_loan_duration  1000 non-null   int64  
 2   credit_history        1000 non-null   object 
 3   purpose               1000 non-null   object 
 4   amount                1000 non-null   int64  
 5   savings_balance       817 non-null    float64
 6   employment_length     938 non-null    object 
 7   installment_rate      1000 non-null   int64  
 8   personal_status       690 non-null    object 
 9   other_debtors         93 non-null     object 
 10  residence_history     870 non-null    object 
 11  property              1000 non-null   object 
 12  age                   1000 non-null   int64  
 13  installment_plan      186 non-null    object 
 14  housing               1000 non-null   object 
 15  existing_credits      1000 non-null   int64  
 16  default               1000 non-null   int64  
 17  dependents            1000 non-null   int64  
 18  telephone             404 non-null    float64
 19  foreign_worker        1000 non-null   object 
 20  job                   1000 non-null   object 
 21  gender                1000 non-null   object 
dtypes: float64(3), int64(7), object(12)
memory usage: 179.7+ KB
  • Both object features and numeric features such as: checking_balance, savings_balance, employment_length, residence_history contains NA values
  • As lightgbm supports data with NA values, we don't need to replace the NA values here
In [6]:
def unify_length(s):
    if not pd.isna(s):
        v = int(s.split(' ')[0])
        if 'years' in s:
            return v
        else:
            return v/12.0
    return 0 

data['employment_length'] = data.employment_length.apply(unify_length)
data['residence_history'] = data.residence_history.apply(unify_length)
data['phone_availability'] = data.telephone.apply(lambda x: 0 if pd.isna(x) else 1)
data.drop('telephone', axis = 1, inplace = True)
In [7]:
data.describe(percentiles=[.1,.25,.5,.75,.9,.95,.99])
Out[7]:
checking_balance months_loan_duration amount savings_balance employment_length installment_rate residence_history age existing_credits default dependents phone_availability
count 606.000000 1000.000000 1000.000000 817.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 97.245875 20.903000 3271.258000 781.570379 4.943583 2.973000 6.565833 35.546000 1.407000 0.300000 1.155000 0.404000
std 206.923583 12.058814 2822.736876 3016.983785 5.278104 1.118715 7.802016 11.375469 0.577654 0.458487 0.362086 0.490943
min -50.000000 4.000000 250.000000 0.000000 0.000000 1.000000 0.000000 19.000000 1.000000 0.000000 1.000000 0.000000
10% -41.000000 9.000000 932.000000 13.000000 0.250000 1.000000 0.000000 23.000000 1.000000 0.000000 1.000000 0.000000
25% -23.000000 12.000000 1365.500000 31.000000 1.000000 2.000000 0.333333 27.000000 1.000000 0.000000 1.000000 0.000000
50% 24.000000 18.000000 2319.500000 64.000000 3.000000 3.000000 2.000000 33.000000 1.000000 0.000000 1.000000 0.000000
75% 131.750000 24.000000 3972.250000 128.000000 7.000000 4.000000 13.000000 42.000000 2.000000 1.000000 1.000000 1.000000
90% 262.000000 36.000000 7179.400000 705.200000 14.000000 4.000000 20.000000 52.000000 2.000000 1.000000 2.000000 1.000000
95% 638.250000 48.000000 9162.700000 2795.800000 17.000000 4.000000 22.000000 60.000000 2.000000 1.000000 2.000000 1.000000
99% 934.600000 60.000000 14180.390000 18474.720000 19.000000 4.000000 24.000000 67.010000 3.000000 1.000000 2.000000 1.000000
max 999.000000 72.000000 18424.000000 19972.000000 19.000000 4.000000 24.000000 75.000000 4.000000 1.000000 2.000000 1.000000

Some more important observations here, before we dive into performing the EDA and pre-processing tasks onto our data:

  • The descriptive table shows that there is no non-trivial value in the dataset
  • At least 25% of customers have a negative balance
  • 30% of customer has defaulted
  • Only 40% of customer has phone number
In [8]:
data.describe(include='O')
Out[8]:
credit_history purpose personal_status other_debtors property installment_plan housing foreign_worker job gender
count 1000 1000 690 93 1000 186 1000 1000 1000 1000
unique 5 10 3 2 4 2 3 2 4 2
top repaid radio/tv single guarantor other bank own yes skilled employee male
freq 530 280 548 52 332 139 713 963 630 690
  • All category features are feasible to use as model features

2. EDA - Exploratory Data Analysis¶

In this section, we will explore:

  • Distribution of categorical and numerical features
  • Relationship between target variable and the features

A. Univariable analysis¶

Categorical features¶

In [9]:
def plot_category_dist(col, ax):
    data[col].value_counts().plot(kind = 'bar', facecolor='g', ax=ax)
    for label in ax.get_xmajorticklabels():
        label.set_rotation(30)
        label.set_horizontalalignment("right")
    ax.set_title("{}".format(col), fontsize= 20)
    return ax

f, ax = plt.subplots(3,4, figsize = (22,17))
f.tight_layout(h_pad=10, w_pad=2, rect=[0, 0.03, 1, 0.93])
plt.rc('xtick', labelsize=16)
category_cols = ['credit_history', 'purpose', 'personal_status', 'other_debtors', 'property', 'phone_availability',
                 'installment_plan', 'housing', 'foreign_worker', 'job', 'gender']
k = 0
for i in range(3):
    for j in range(4):
        try:
            plot_category_dist(category_cols[k], ax[i][j])
            k += 1
        except:
            pass  # skip if there is no data to fill the grid
__ = plt.suptitle("Distributions of category features", fontsize= 22)

Overall for categorical features, there are some observations:

  • More than 95% of customers are foreign workers --> this feature should be removed from the feature list, but in this project, I will keep this feature and check the important score at the end of the notebook
  • Buying radio/tv, car and furniture are the most popular purpose of our customers
  • 40% of customer had credit history either delayed or critical status
  • More than 60% of customers are skilled employee

Numerical features¶

In [10]:
def plot_numeric_dist(col, ax):
    data[col].plot(kind = 'density', ax=ax,color='g')
    ax.set_title("{}".format(col), fontsize= 20)
    return ax

f, ax = plt.subplots(3,4, figsize = (22,15))
f.tight_layout(h_pad=10, w_pad=2, rect=[0, 0.03, 1, 0.93])

numeric_cols = ['checking_balance', 'months_loan_duration', 'amount', 'savings_balance', 'employment_length', 'installment_rate',
                 'residence_history', 'age', 'existing_credits', 'dependents']
k = 0
for i in range(3):
    for j in range(4):
        if k >= len(numeric_cols):
            break
        plot_numeric_dist(numeric_cols[k], ax[i][j])
        k += 1
__ = plt.suptitle("Distributions of numeric features", fontsize= 22)
  • Most of the customer are at the age of 30
  • Employment length and residence history shared the same distribution
  • Month loan duration is distributed at the range of 6-25

Target analysis¶

In [11]:
sns.countplot(x = data['default'],palette=['tab:green', 'tab:orange']);
In [12]:
data.default.mean()
Out[12]:
0.3
  • 30% of customer has defaulted
  • The data is not too imbalance

B. Multivariable analysis¶

Categorical features¶

In [13]:
fig, axes = plt.subplots(4, 3, figsize=(15, 18))
fig.tight_layout(h_pad=8, w_pad=2)
plt.rc('xtick', labelsize=8)
data['frequency'] = 0 # a dummy column to refer to

total = len(data)
for col, ax in zip(category_cols, axes.flatten()):
    try:
        counts = data.groupby([col, 'default']).count()
        freq_per_total = counts.div(total).reset_index()
        sns.barplot(x=col, y='frequency', hue='default', data=freq_per_total, ax=ax, palette=['tab:green', 'tab:orange'])

        for label in ax.get_xmajorticklabels():
            label.set_rotation(30)
            label.set_horizontalalignment("right")
    except:
        pass
data.drop('frequency', axis=1, inplace= True)

Key findings

  • Credit history shows a significant discriminative between the defaulted customers and non-defaulted customers: the number of the defaulted customer with the credit history of "fully repaid" and "fully repaid this bank" is more than the non-defaulted customer.
  • Customers took a loan with the purpose of buying car (used), radio/tv and retraining are our good customers. Most of them are not defaulted.
  • There is no different between defaulted customer and non-defaulted customer in the feature of foreign worker and job ==> As mentioned above, the number of feature is quite small ~20, we will keep those features to check again our comments in the model important features section
In [14]:
fig = px.parallel_categories(data[['purpose', 'credit_history', 'housing', 'gender', 'default']], color="default",
                            color_continuous_scale=px.colors.diverging.Temps)
fig.show()
  • The graph shows interactive behaviors between purpose, credit history, housing, and gender for defaulted customers and non-defaulted customer
  • Defaulted customer with the gender of 'Male' has strongly relationship with the features of: housing = own, credit_history = repaid,
    purpose = car (new) | furniture | radio/tv

Numerical features¶

In [15]:
fig, axes = plt.subplots(3, 4, figsize=(15, 12))
fig.tight_layout(h_pad=3, w_pad=2)
plt.rc('xtick', labelsize=10)
total = len(data)
for col, ax in zip(numeric_cols, axes.flatten()):
    try:
        sns.histplot(data, x = col, hue= 'default', element="poly", ax=ax, palette=['tab:green', 'tab:orange'])
    except:
        pass
In [16]:
data[data['checking_balance'] <0].default.mean()
Out[16]:
0.4927007299270073
In [17]:
data[data['checking_balance'] >0].default.mean()
Out[17]:
0.35843373493975905
In [18]:
data[data['months_loan_duration'] <30].default.mean()
Out[18]:
0.25921219822109276
In [19]:
data[data['months_loan_duration'] >=30].default.mean()
Out[19]:
0.4507042253521127
In [20]:
data[data['age'] <=25].default.mean()
Out[20]:
0.42105263157894735
In [21]:
data[data['age'] >25].default.mean()
Out[21]:
0.2716049382716049

Key findings

  • The distribution of defaulted customer and non-defaulted customer checking_balance, age, months_loan_duration are significant different
  • ~50% of defaulted customers are distributed in the pool wiht negative checking balance while this number in positive checking balance pool is only 35%
  • Customers are likely to default with the long tenure (long loan duration) - 45% with the tenure >=30 while this number for tenure <30 is only 25%
  • Young customers with the age less than 25 are more risky than old customers
In [22]:
fig = px.parallel_coordinates(data[['checking_balance', 'age', 'months_loan_duration', 'amount', 'default']], color="default",
                            color_continuous_scale=px.colors.diverging.Temps)
fig.show()
  • The graph shows interactive behaviors between checking balance, age, months loan duration, and amount for defaulted customers and non-defaulted customers
  • High loan amount, long loan duration, young age, and negative checking balance are the key factors of defaulted customers
In [23]:
fig = px.imshow(data.corr()*100, text_auto=".1f", aspect="auto")
fig.show()
  • There is no significant correlation between numerical features

3. Feature Transformation and Train/Test Split¶

  • All the categorical will be transformed using label encoding and converted to category type as we are using the classifier of lightgbm --> it would be better than one hot encoding
  • The data will be splited into train/test with the ratio of 80:20
In [24]:
# In practic, the encoder need to be save to tranform for the future data
# As this is just a test I will not store the encoder here 
data[category_cols] = data[category_cols].apply(LabelEncoder().fit_transform)

# Convert all categorical columns to category type
data[category_cols] = data[category_cols].astype('category')
In [25]:
# Keep 80% of the data for training set and 20% for the testing set
train, test = train_test_split(data, test_size=0.2, random_state=123456)
In [26]:
# Convert training set and testing set to lightgbm dataset
train_data = lgb.Dataset(train.drop('default', axis=1), label=train.default)
valid_data = lgb.Dataset(test.drop('default', axis=1), label=test.default, reference=train_data)

4. Train and Validate The Model¶

In this section, I will

  • Train the classifier by using lightgbm
  • Plot feature important
  • Check the discriminative power of train/test

I skip the model parameter tuning section in this project due to the limited of time. The parameter is chosen by my experience on other credit risk model, the performance of the model is good also

In [27]:
params =  {
               "objective": "binary",
                'random_state': 42,
                'metric': 'auc',
                "verbosity": -1,
                "boosting_type": "gbdt",
                "early_stopping_rounds":50,
                'learning_rate': 0.004, 
                'n_estimators': 500,
                'lambda_l1':0.9,
                'lambda_l2': 4.5,
                'feature_fraction': 0.38,
                'bagging_fraction': 0.81,
                'bagging_freq': 50,
                'min_child_samples': 35,
    
               }
In [28]:
model = lgb.train(params, train_data,                     
                     valid_sets=[train_data, valid_data],
                     valid_names=['train', 'test'])
[1]	train's auc: 0.680231	test's auc: 0.554228
Training until validation scores don't improve for 50 rounds
[2]	train's auc: 0.790829	test's auc: 0.709272
[3]	train's auc: 0.811118	test's auc: 0.761087
[4]	train's auc: 0.818334	test's auc: 0.781422
[5]	train's auc: 0.819416	test's auc: 0.783088
[6]	train's auc: 0.821899	test's auc: 0.774299
[7]	train's auc: 0.814585	test's auc: 0.769244
[8]	train's auc: 0.811831	test's auc: 0.776597
[9]	train's auc: 0.807324	test's auc: 0.776137
[10]	train's auc: 0.805821	test's auc: 0.773725
[11]	train's auc: 0.810588	test's auc: 0.779699
[12]	train's auc: 0.812154	test's auc: 0.787339
[13]	train's auc: 0.81155	test's auc: 0.787799
[14]	train's auc: 0.817207	test's auc: 0.785846
[15]	train's auc: 0.821294	test's auc: 0.789982
[16]	train's auc: 0.821316	test's auc: 0.795841
[17]	train's auc: 0.82348	test's auc: 0.799747
[18]	train's auc: 0.823574	test's auc: 0.803768
[19]	train's auc: 0.822838	test's auc: 0.803998
[20]	train's auc: 0.827913	test's auc: 0.803768
[21]	train's auc: 0.828506	test's auc: 0.801815
[22]	train's auc: 0.828882	test's auc: 0.803079
[23]	train's auc: 0.827334	test's auc: 0.799977
[24]	train's auc: 0.825546	test's auc: 0.799632
[25]	train's auc: 0.825539	test's auc: 0.799632
[26]	train's auc: 0.828702	test's auc: 0.800092
[27]	train's auc: 0.831707	test's auc: 0.801471
[28]	train's auc: 0.830144	test's auc: 0.801126
[29]	train's auc: 0.82949	test's auc: 0.801126
[30]	train's auc: 0.831632	test's auc: 0.803194
[31]	train's auc: 0.83206	test's auc: 0.804228
[32]	train's auc: 0.832653	test's auc: 0.803883
[33]	train's auc: 0.831774	test's auc: 0.805836
[34]	train's auc: 0.830602	test's auc: 0.805492
[35]	train's auc: 0.829708	test's auc: 0.803653
[36]	train's auc: 0.828424	test's auc: 0.801585
[37]	train's auc: 0.829006	test's auc: 0.803998
[38]	train's auc: 0.830065	test's auc: 0.806641
[39]	train's auc: 0.829359	test's auc: 0.804228
[40]	train's auc: 0.829442	test's auc: 0.805607
[41]	train's auc: 0.82833	test's auc: 0.804343
[42]	train's auc: 0.827909	test's auc: 0.804802
[43]	train's auc: 0.826737	test's auc: 0.802964
[44]	train's auc: 0.827105	test's auc: 0.803998
[45]	train's auc: 0.828968	test's auc: 0.805492
[46]	train's auc: 0.829141	test's auc: 0.805722
[47]	train's auc: 0.829494	test's auc: 0.809972
[48]	train's auc: 0.828525	test's auc: 0.807215
[49]	train's auc: 0.828901	test's auc: 0.806296
[50]	train's auc: 0.829321	test's auc: 0.805722
[51]	train's auc: 0.830118	test's auc: 0.807215
[52]	train's auc: 0.830847	test's auc: 0.806641
[53]	train's auc: 0.830719	test's auc: 0.805607
[54]	train's auc: 0.830215	test's auc: 0.804228
[55]	train's auc: 0.830899	test's auc: 0.806526
[56]	train's auc: 0.831342	test's auc: 0.806526
[57]	train's auc: 0.831853	test's auc: 0.806411
[58]	train's auc: 0.831711	test's auc: 0.804688
[59]	train's auc: 0.831726	test's auc: 0.805951
[60]	train's auc: 0.832146	test's auc: 0.806181
[61]	train's auc: 0.831635	test's auc: 0.80733
[62]	train's auc: 0.832191	test's auc: 0.80687
[63]	train's auc: 0.833461	test's auc: 0.805607
[64]	train's auc: 0.833033	test's auc: 0.806526
[65]	train's auc: 0.833949	test's auc: 0.805607
[66]	train's auc: 0.834107	test's auc: 0.806411
[67]	train's auc: 0.834618	test's auc: 0.805607
[68]	train's auc: 0.834626	test's auc: 0.805951
[69]	train's auc: 0.833927	test's auc: 0.806411
[70]	train's auc: 0.835602	test's auc: 0.804917
[71]	train's auc: 0.835166	test's auc: 0.805377
[72]	train's auc: 0.835828	test's auc: 0.806526
[73]	train's auc: 0.836181	test's auc: 0.809283
[74]	train's auc: 0.837015	test's auc: 0.809743
[75]	train's auc: 0.836827	test's auc: 0.810777
[76]	train's auc: 0.836917	test's auc: 0.810662
[77]	train's auc: 0.837563	test's auc: 0.810662
[78]	train's auc: 0.838082	test's auc: 0.811236
[79]	train's auc: 0.839006	test's auc: 0.811236
[80]	train's auc: 0.838968	test's auc: 0.811926
[81]	train's auc: 0.839178	test's auc: 0.811236
[82]	train's auc: 0.838683	test's auc: 0.811926
[83]	train's auc: 0.839321	test's auc: 0.813074
[84]	train's auc: 0.839599	test's auc: 0.814338
[85]	train's auc: 0.839562	test's auc: 0.815372
[86]	train's auc: 0.839674	test's auc: 0.814798
[87]	train's auc: 0.839464	test's auc: 0.814108
[88]	train's auc: 0.839066	test's auc: 0.813304
[89]	train's auc: 0.840358	test's auc: 0.813879
[90]	train's auc: 0.840673	test's auc: 0.815372
[91]	train's auc: 0.84026	test's auc: 0.815142
[92]	train's auc: 0.840456	test's auc: 0.816062
[93]	train's auc: 0.840583	test's auc: 0.816751
[94]	train's auc: 0.84032	test's auc: 0.816291
[95]	train's auc: 0.841004	test's auc: 0.816636
[96]	train's auc: 0.840719	test's auc: 0.816521
[97]	train's auc: 0.841199	test's auc: 0.817096
[98]	train's auc: 0.841267	test's auc: 0.818934
[99]	train's auc: 0.842431	test's auc: 0.819508
[100]	train's auc: 0.842679	test's auc: 0.819049
[101]	train's auc: 0.84319	test's auc: 0.819393
[102]	train's auc: 0.843829	test's auc: 0.817785
[103]	train's auc: 0.84434	test's auc: 0.818359
[104]	train's auc: 0.84437	test's auc: 0.818015
[105]	train's auc: 0.844993	test's auc: 0.819393
[106]	train's auc: 0.845579	test's auc: 0.819508
[107]	train's auc: 0.846353	test's auc: 0.818819
[108]	train's auc: 0.846075	test's auc: 0.81813
[109]	train's auc: 0.845872	test's auc: 0.8179
[110]	train's auc: 0.846203	test's auc: 0.816521
[111]	train's auc: 0.846759	test's auc: 0.815947
[112]	train's auc: 0.846458	test's auc: 0.815947
[113]	train's auc: 0.847059	test's auc: 0.814798
[114]	train's auc: 0.846443	test's auc: 0.814338
[115]	train's auc: 0.84618	test's auc: 0.813764
[116]	train's auc: 0.845745	test's auc: 0.812615
[117]	train's auc: 0.845084	test's auc: 0.81273
[118]	train's auc: 0.845031	test's auc: 0.81273
[119]	train's auc: 0.844806	test's auc: 0.811926
[120]	train's auc: 0.844558	test's auc: 0.811581
[121]	train's auc: 0.844565	test's auc: 0.810777
[122]	train's auc: 0.845099	test's auc: 0.809972
[123]	train's auc: 0.845587	test's auc: 0.810432
[124]	train's auc: 0.845955	test's auc: 0.809743
[125]	train's auc: 0.845557	test's auc: 0.809398
[126]	train's auc: 0.845069	test's auc: 0.809053
[127]	train's auc: 0.845527	test's auc: 0.808479
[128]	train's auc: 0.845429	test's auc: 0.809743
[129]	train's auc: 0.845857	test's auc: 0.809053
[130]	train's auc: 0.845497	test's auc: 0.808249
[131]	train's auc: 0.844888	test's auc: 0.80779
[132]	train's auc: 0.845527	test's auc: 0.807445
[133]	train's auc: 0.846105	test's auc: 0.807445
[134]	train's auc: 0.846413	test's auc: 0.808249
[135]	train's auc: 0.846939	test's auc: 0.807445
[136]	train's auc: 0.847345	test's auc: 0.807675
[137]	train's auc: 0.847946	test's auc: 0.807215
[138]	train's auc: 0.848494	test's auc: 0.806066
[139]	train's auc: 0.848201	test's auc: 0.805722
[140]	train's auc: 0.848449	test's auc: 0.805262
[141]	train's auc: 0.848863	test's auc: 0.805262
[142]	train's auc: 0.849651	test's auc: 0.804458
[143]	train's auc: 0.849629	test's auc: 0.805262
[144]	train's auc: 0.849839	test's auc: 0.806411
[145]	train's auc: 0.850065	test's auc: 0.805951
[146]	train's auc: 0.849674	test's auc: 0.805377
[147]	train's auc: 0.849426	test's auc: 0.805492
[148]	train's auc: 0.849464	test's auc: 0.805262
[149]	train's auc: 0.849065	test's auc: 0.804802
Early stopping, best iteration is:
[99]	train's auc: 0.842431	test's auc: 0.819508
C:\Users\Admins\anaconda3\lib\site-packages\lightgbm\engine.py:177: UserWarning:

Found `n_estimators` in params. Will use it instead of argument

C:\Users\Admins\anaconda3\lib\site-packages\lightgbm\basic.py:1780: UserWarning:

Overriding the parameters from Reference Dataset.

C:\Users\Admins\anaconda3\lib\site-packages\lightgbm\basic.py:1513: UserWarning:

categorical_column in param dict is overridden.

  • The model reach the best performance at the iteration of 99. AUC level of test set is 82%
In [29]:
feature_imp = pd.DataFrame(sorted(zip(model.feature_importance(),model.feature_name())), columns=['Value','Feature'])

plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features', fontsize= 18)
plt.tight_layout()
plt.rc('ytick', labelsize=16)
plt.show()
  • The picture shows the important features of the model
  • As predicted in the EDA section, the features of checking balance, months loan duration are the top important features
  • Foreign worker, job, and dependents are the features with less important to the model
In [30]:
train['predict'] = model.predict(train[model.feature_name()])
test['predict'] = model.predict(test[model.feature_name()])
In [31]:
fpr, tpr, thresholds = metrics.roc_curve(train.default, train.predict)
print('GINI of training set: {}%'.format(round((metrics.auc(fpr, tpr)*2 -1)*100)))
GINI of training set: 68%
In [32]:
fpr, tpr, thresholds = metrics.roc_curve(test.default, test.predict)
print('GINI of testing set: {}%'.format(round((metrics.auc(fpr, tpr)*2 -1)*100)))
GINI of testing set: 64%
  • The performance of the model is good with GINI level on testing set is 64%
  • Checking balance and months loan duration features are the most important features

5. Model explanation, performance and strategy¶

In this section:

  • Explain the model with shap
  • Plot the gain chart and lift chart
  • Check the model shifting with PSI

Shap Values

In [33]:
shap_values = shap.TreeExplainer(model).shap_values(test[model.feature_name()])
shap.summary_plot(shap_values[1], test[model.feature_name()])
C:\Users\Admins\anaconda3\lib\site-packages\shap\explainers\_tree.py:353: UserWarning:

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

With the top important features, the explanation of the model is as below (I skip the explanation for cateforical features as it can't be shown in the shap plot and need to be visualized separately):

  • low value of checking balance will increase the probability of default while hight value of checking balance and NA value will decrease the probability of default
  • Same as in the EDA section, the loan with higher tenure are likely to default
  • The lower saving balance, the higher probability of default
  • If the customer has less exprerience (employment length is small), they are likely to default

Gain chart, Lift chart

In [34]:
kds.metrics.report(test.default, test.predict,plot_style='ggplot')
LABELS INFO:

 prob_min         : Minimum probability in a particular decile
 prob_max         : Minimum probability in a particular decile
 prob_avg         : Average probability in a particular decile
 cnt_events       : Count of events in a particular decile
 cnt_resp         : Count of responders in a particular decile
 cnt_non_resp     : Count of non-responders in a particular decile
 cnt_resp_rndm    : Count of responders if events assigned randomly in a particular decile
 cnt_resp_wiz     : Count of best possible responders in a particular decile
 resp_rate        : Response Rate in a particular decile [(cnt_resp/cnt_cust)*100]
 cum_events       : Cumulative sum of events decile-wise 
 cum_resp         : Cumulative sum of responders decile-wise 
 cum_resp_wiz     : Cumulative sum of best possible responders decile-wise 
 cum_non_resp     : Cumulative sum of non-responders decile-wise 
 cum_events_pct   : Cumulative sum of percentages of events decile-wise 
 cum_resp_pct     : Cumulative sum of percentages of responders decile-wise 
 cum_resp_pct_wiz : Cumulative sum of percentages of best possible responders decile-wise 
 cum_non_resp_pct : Cumulative sum of percentages of non-responders decile-wise 
 KS               : KS Statistic decile-wise 
 lift             : Cumuative Lift Value decile-wise
Out[34]:
decile prob_min prob_max prob_avg cnt_cust cnt_resp cnt_non_resp cnt_resp_rndm cnt_resp_wiz resp_rate cum_cust cum_resp cum_resp_wiz cum_non_resp cum_cust_pct cum_resp_pct cum_resp_pct_wiz cum_non_resp_pct KS lift
0 1 0.331 0.358 0.340 20.0 14.0 6.0 6.4 20 70.0 20.0 14.0 20 6.0 10.0 21.875 31.25 4.412 17.463 2.188
1 2 0.318 0.330 0.325 20.0 15.0 5.0 6.4 20 75.0 40.0 29.0 40 11.0 20.0 45.312 62.50 8.088 37.224 2.266
2 3 0.308 0.318 0.314 20.0 13.0 7.0 6.4 20 65.0 60.0 42.0 60 18.0 30.0 65.625 93.75 13.235 52.390 2.188
3 4 0.303 0.308 0.306 20.0 5.0 15.0 6.4 4 25.0 80.0 47.0 64 33.0 40.0 73.438 100.00 24.265 49.173 1.836
4 5 0.293 0.302 0.297 20.0 8.0 12.0 6.4 0 40.0 100.0 55.0 64 45.0 50.0 85.938 100.00 33.088 52.850 1.719
5 6 0.284 0.292 0.288 20.0 0.0 20.0 6.4 0 0.0 120.0 55.0 64 65.0 60.0 85.938 100.00 47.794 38.144 1.432
6 7 0.275 0.283 0.279 20.0 5.0 15.0 6.4 0 25.0 140.0 60.0 64 80.0 70.0 93.750 100.00 58.824 34.926 1.339
7 8 0.266 0.275 0.271 20.0 2.0 18.0 6.4 0 10.0 160.0 62.0 64 98.0 80.0 96.875 100.00 72.059 24.816 1.211
8 9 0.259 0.266 0.263 20.0 1.0 19.0 6.4 0 5.0 180.0 63.0 64 117.0 90.0 98.438 100.00 86.029 12.409 1.094
9 10 0.246 0.259 0.254 20.0 1.0 19.0 6.4 0 5.0 200.0 64.0 64 136.0 100.0 100.000 100.00 100.000 0.000 1.000
  • KS of the model is at level of 53% at decile 5
  • By removing the first 5 deciles, we can reduce the risk by 85%

Population shifting index (PSI)

In [35]:
train.predict.describe(percentiles = [.2,.4,.6,.8])
Out[35]:
count    800.000000
mean       0.291962
std        0.028907
min        0.240708
20%        0.263450
40%        0.279919
50%        0.289279
60%        0.297193
80%        0.321280
max        0.371669
Name: predict, dtype: float64
In [36]:
def get_quintile(p):
    if p<=0.263450:
        return 1
    elif p<=0.279919:
        return 2
    elif p<=0.297193:
        return 3
    elif p<=0.321280:
        return 4
    else:
        return 5
In [ ]:
 
In [37]:
# As the sample size is too small, we should not calculate the PSI at decile level
# In this case, I calculate the PSI by quintile level (20%)
test['Q'] = test.predict.apply(get_quintile)
quintile_table = test.Q.value_counts().reset_index().sort_values('index')
quintile_table.columns = ['Q', 'count_test']
quintile_table['%-test'] = quintile_table['count_test']/test.shape[0]
quintile_table['PSI'] = quintile_table['%-test'].apply(lambda x: (0.2 - x)*math.log(0.2/x))
In [38]:
quintile_table
Out[38]:
Q count_test %-test PSI
4 1 31 0.155 0.011470
1 2 43 0.215 0.001085
2 3 38 0.190 0.000513
0 4 51 0.255 0.013362
3 5 37 0.185 0.001169
In [39]:
quintile_table['PSI'] = quintile_table['PSI']*100
ax = quintile_table.plot(x='Q', y='PSI', ylim=(0,10), color = 'g')
import matplotlib.ticker as mtick
ax.yaxis.set_major_formatter(mtick.PercentFormatter())
plt.rc('ytick', labelsize=10)
  • The model is stable with the PSI less than 2% for all quintiles

6. Conclusion¶

  • The model has been trained with the good performance, GINI level of 64% on testing set.
  • By removing the first 5 deciles, we can reduce the risk by 85%
  • Model is stable with the PSI less than 2%
  • checking balance, months loan duration and savings balance are the key factors to predict the defaulters
  • low value of checking balance will increase the probability of default